NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

G-Thinkerq: A General Subgraph Querying System With a Unified Task-Based Programming Model

https://doi.org/10.1109/TKDE.2025.3537964

Yuan, Lyuheng; Guo, Guimu; Yan, Da; Adhikari, Saugat; Khalil, Jalal; Long, Cheng; Zou, Lei (June 2025, IEEE Transactions on Knowledge and Data Engineering)

Free, publicly-accessible full text available June 1, 2026
EvaNet: Elevation-Guided Flood Extent Mapping on Earth Imagery

Sami, Mirza Tanzim; Yan, Da; Adhikari, Saugat; Yuan, Lyuheng; Han, Jiao; Jiang, Zhe; Khalil, Jalal; Zhou, Yang (October 2024, IJCAI)

Accurate and timely mapping of flood extent from high-resolution satellite imagery plays a crucial role in disaster management such as damage assessment and relief activities. However, current state-of-the-art solutions are based on U-Net, which cannot segment the flood pixels accurately due to the ambiguous pixels (e.g., tree canopies, clouds) that prevent a direct judgement from only the spectral features. Thanks to the digital elevation model (DEM) data readily available from sources such as United States Geological Survey (USGS), this work explores the use of an elevation map to improve flood extent mapping. We propose, EvaNet, an elevation-guided segmentation model based on the encoder-decoder architecture with two novel techniques: (1) a loss function encoding the physical law of gravity that if a location is flooded (resp. dry), then its adjacent locations with a lower (resp. higher) elevation must also be flooded (resp. dry); (2) a new (de)convolution operation that integrates the elevation map by a location-sensitive gating mechanism to regulate how much spectral features flow through adjacent layers. Extensive experiments show that EvaNet significantly outperforms the U-Net baselines, and works as a perfect drop-in replacement for U-Net in existing solutions to flood extent mapping. EvaNet is open-sourced at https://github.com/MTSami/EvaNet
more » « less
Full Text Available
FSM-Explorer: An Interactive Tool for Frequent Subgraph Pattern Mining from a Big Graph

Khalil, Jalal; Yan, Da; Yuan, Lyuheng; Han, Jiao; Adhikari Saugat; Long Cheng; Zhou Yang (April 2024, 40th IEEE International Conference on Data Engineering)

In this demonstration paper, we describe FSM-Explorer, an interactive tool for that makes it easier for end-users to mine frequent subgraph patterns from a big graph G, and to explore the subgraph instances in G that match the patterns. FSM-Explorer not only supports the popular MNI support measure, but also the recently proposed Fraction-Score measure that is more accurate. Its backend engine is built on top of the recent T-FSM system that ensures high concurrency, bounded memory consumption, and effective load balancing. Using real-world data, we showcase how users can mine frequent subgraph patterns by parameter tuning in FSM-Explorer, and how they can conveniently examine the many matched instances in G one batch at a time to improve productivity.
more » « less
T-FSM: A Task-Based System for Massively Parallel Frequent Subgraph Pattern Mining from a Big Graph

https://doi.org/10.1145/3588928

Yuan, Lyuheng; Yan, Da; Qu, Wenwen; Adhikari, Saugat; Khalil, Jalal; Long, Cheng; Wang, Xiaoling (May 2023, Proceedings of the ACM on Management of Data)

Finding frequent subgraph patterns in a big graph is an important problem with many applications such as classifying chemical compounds and building indexes to speed up graph queries. Since this problem is NP-hard, some recent parallel systems have been developed to accelerate the mining. However, they often have a huge memory cost, very long running time, suboptimal load balancing, and possibly inaccurate results. In this paper, we propose an efficient system called T-FSM for parallel mining of frequent subgraph patterns in a big graph. T-FSM adopts a novel task-based execution engine design to ensure high concurrency, bounded memory consumption, and effective load balancing. It also supports a new anti-monotonic frequentness measure called Fraction-Score, which is more accurate than the widely used MNI measure. Our experiments show that T-FSM is orders of magnitude faster than SOTA systems for frequent subgraph pattern mining. Our system code has been released at https://github.com/lyuheng/T-FSM.
more » « less
Full Text Available
An elevation-guided annotation tool for flood extent mapping on earth imagery (demo paper)

https://doi.org/10.1145/3557915.3560962

Adhikari, Saugat; Yan, Da; Sami, Mirza Tanzim; Khalil, Jalal; Yuan, Lyuheng; Joy, Bhadhan Roy; Jiang, Zhe; Sainju, Arpan Man (November 2022, SIGSPATIAL '22: Proceedings of the 30th International Conference on Advances in Geographic Information Systems)

Full Text Available
Parallel mining of large maximal quasi-cliques

https://doi.org/10.1007/s00778-021-00712-2

Khalil, Jalal; Guo, Guimu; Yuan, Lyuheng (January 2022, The VLDB journal)

Given a user-specified minimum degree threshold γ, a γ-quasi-clique is a subgraph where each vertex connects to at least γ fraction of the other vertices. Quasi-clique is a natural definition for dense structures, so finding large and hence statistically significant quasi-cliques is useful in applications such as community detection in social networks and discovering significant biomolecule structures and pathways. However, mining maximal quasi-cliques is notoriously expensive, and even a recent algorithm for mining large maximal quasi-cliques is flawed and can lead to a lot of repeated searches. This paper proposes a parallel solution for mining maximal quasi-cliques that is able to fully utilize CPU cores. Our solution utilizes divide and conquer to decompose the workloads into independent tasks for parallel mining, and we addressed the problem of (i) drastic load imbalance among different tasks and (ii) difficulty in predicting the task running time and the time growth with task subgraph size, by (a) using a timeout-based task decomposition strategy, and by (b) utilizing a priority task queue to schedule long-running tasks earlier for mining and decomposition to avoid stragglers. Unlike our conference version in PVLDB 2020 where the solution was built on a distributed graph mining framework called G-thinker, this paper targets a single-machine multi-core environment which is more accessible to an average end user. A general framework called T-thinker is developed to facilitate the programming of parallel programs for algorithms that adopt divide and conquer, including but not limited to our quasi-clique mining algorithm. Additionally, we consider the problem of directly mining large quasi-cliques from dense parts of a graph, where we identify the repeated search issue of a recent method and address it using a carefully designed concurrent trie data structure. Extensive experiments verify that our parallel solution scales well with the number of CPU cores, achieving 26.68× runtime speedup when mining a graph with 3.77M vertices and 16.5M edges with 32 mining threads. Additionally, mining large quasi-cliques from dense parts can provide an additional speedup of up to 89.46×.
more » « less
Full Text Available
Maximal Directed Quasi-Clique Mining

Guo, Guimu; Yan, Da; Yuan, Lyuheng; Khalil, Jalal; Long, Cheng; Jiang, Zhe; Zhou, Yang (January 2022, Proceedings of the 38th IEEE International Conference on Data Engineering (ICDE))

Quasi-cliques are a type of dense subgraphs that generalize the notion of cliques, important for applications such as community/module detection in various social and biological networks. However, the existing quasi-clique definition and algorithms are only applicable to undirected graphs. In this paper, we generalize the concept of quasi-cliques to directed graphs by proposing $$(\gamma_1, \gamma_2)$$-quasi-cliques which have density requirements in both inbound and outbound directions of each vertex in a quasi-clique subgraph. An efficient recursive algorithm is proposed to find maximal $$(\gamma_1, \gamma_2)$$-quasi-cliques which integrates many effective pruning rules that are validated by ablation studies. We also study the finding of top-$$k$$ large quasi-cliques directly by bootstrapping the search from more compact quasi-cliques, to scale the mining to larger networks. The algorithms are parallelized with effective load balancing, and we demonstrate that they can scale up effectively with the number of CPU cores.
more » « less
Full Text Available
G-thinker: a general distributed framework for finding qualified subgraphs in a big graph with load balancing

https://doi.org/10.1007/s00778-021-00688-z

Yan, Da; Guo, Guimu; Khalil, Jalal; Özsu, M. Tamer; Ku, Wei-Shinn; Lui, John C. (January 2022, The VLDB journal)

Finding from a big graph those subgraphs that satisfy certain conditions is useful in many applications such as community detection and subgraph matching. These problems have a high time complexity, but existing systems that attempt to scale them are all IO-bound in execution. We propose the first truly CPU-bound distributed framework called G-thinker for subgraph finding algorithms, which adopts a task-based computation model, and which also provides a user-friendly subgraph-centric vertex-pulling API for writing distributed subgraph finding algorithms that can be easily adapted from existing serial algorithms. To utilize all CPU cores of a cluster, G-thinker features (1) a highly concurrent vertex cache for parallel task access and (2) a lightweight task scheduling approach that ensures high task throughput. These designs well overlap communication with computation to minimize the idle time of CPU cores. To further improve load balancing on graphs where the workloads of individual tasks can be drastically different due to biased graph density distribution, we propose to prioritize the scheduling of those tasks that tend to be long running for processing and decomposition, plus a timeout mechanism for task decomposition to prevent long-running straggler tasks. The idea has been integrated into a novelty algorithm for maximum clique finding (MCF) that adopts a hybrid task decomposition strategy, which significantly improves the running time of MCF on dense and large graphs: The algorithm finds a maximum clique of size 1,109 on a large and dense WikiLinks graph dataset in 70 minutes. Extensive experiments demonstrate that G-thinker achieves orders of magnitude speedup compared even with the fastest existing subgraph-centric system, and it scales well to much larger and denser real network data. G-thinker is open-sourced at http://bit.ly/gthinker with detailed documentation.
more » « less
Full Text Available
Scalable Mining of Maximal Quasi-Cliques: An Algorithm-System Codesign Approach

https://doi.org/10.14778/3436905.3436916

Guo, Guimu; Yan, Da; Özsu, M. Tamer; Jiang, Zhe; Khalil, Jalal (January 2020, Proceedings of the VLDB Endowment)
null (Ed.)
Given a user-specified minimum degree threshold γ, a γ-quasi-clique is a subgraph g = (Vg, Eg) where each vertex ν ∈ Vg connects to at least γ fraction of the other vertices (i.e., ⌈γ · (|Vg|- 1)⌉ vertices) in g. Quasi-clique is one of the most natural definitions for dense structures useful in finding communities in social networks and discovering significant biomolecule structures and pathways. However, mining maximal quasi-cliques is notoriously expensive. In this paper, we design parallel algorithms for mining maximal quasi-cliques on G-thinker, a distributed graph mining framework that decomposes mining into compute-intensive tasks to fully utilize CPU cores. We found that directly using G-thinker results in the straggler problem due to (i) the drastic load imbalance among different tasks and (ii) the difficulty of predicting the task running time. We address these challenges by redesigning G-thinker's execution engine to prioritize long-running tasks for execution, and by utilizing a novel timeout strategy to effectively decompose long-running tasks to improve load balancing. While this system redesign applies to many other expensive dense subgraph mining problems, this paper verifies the idea by adapting the state-of-the-art quasi-clique algorithm, Quick, to our redesigned G-thinker. Extensive experiments verify that our new solution scales well with the number of CPU cores, achieving 201× runtime speedup when mining a graph with 3.77M vertices and 16.5M edges in a 16-node cluster.
more » « less
Full Text Available

Search for: All records